Skip to content

Conversation

@Mateus-Cordeiro
Copy link
Collaborator

@Mateus-Cordeiro Mateus-Cordeiro commented Feb 10, 2026

This PR adds structured progress feedback to the build command, usable both from the CLI and from SDK callers.

Changes

Progress API

  • Introduces a lightweight ProgressCallback and a small set of ProgressEvent kinds to report:
    • build start/finish
    • datasource start/finish
    • datasource progress
  • Uses a ProgressEmitter

Build pipeline wiring

  • Threads ProgressCallback (optional) through the build entrypoints.
  • Emits lifecycle events during execution

CLI progress rendering (Rich)

  • Adds a rich_progress() context manager that renders:
    • an overall Datasources i/n progress bar
    • a per datasource progress bar
    • per datasource completion lines + final summary message
  • Disabled when not running in an interactive terminal
  • Set as an optional cli extra with fallback to no-op if rich isn't installed.
Screen.Recording.2026-02-10.at.11.06.41.mov

for context in contexts:
emitter = ProgressEmitter(progress)

if not contexts:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if-block is not needed. It will behave the exact same way without it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll remove.

@dataclass(frozen=True, slots=True)
class ProgressEvent:
kind: ProgressKind
datasource_id: str | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems datasource_id is not_null here? or am I missing something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, not all events emitted are datasource-scope. Events such as TASK_STARTED and TASK_FINISHED do not use a datasource_id, so it's made optional here. We can implement more versions of ProgressEvents for each ProgressKind, and then each kind of ProgressEvent would have a more defined structure, but it might be overkill at this point.

_set_datasource_percent(float(pct))
return

root = logging.getLogger()
Copy link
Collaborator

@hsestupin hsestupin Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this logger manipulation is needed? Looks sketchy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code is doing 1 thing mainly: silencing all logging globally. It is only activated if we are working with an interactive terminal and the rich bar progress shows up.
This was done because "normal" logging causes big disturbances when interactive with rich logging.

There are alternatives to this, such as routing logs via a RichHandler when in interactive mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new version of the PR, this is made easier, by routing the logs via a RichHandler, and this bottom part of the code looks less sketchy too.

)
)
if i % EMIT_EVERY == 0 or i == len(chunks):
total_units = len(chunks) * 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot grasp the logic with len(chunks) multiplied by 2. Can you explain please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ok, I see. For every embedding there are 2 steps - get embedding and persist it.

Copy link
Collaborator

@hsestupin hsestupin Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem would be that plugin execution is actually a single big progress step in the progress bar. Did you skip this part intentionally or leave for later?

I think we need some DatasourceEmitter object which would know everything about datasource progress status. Another problem at the moment is that this logic about total_units is separated between 2 different methods. They need to know about each other and become tightly coupled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the *2 exists because each chunk will have 2 operations performed for it. Regarding the plugin execution, I decided not to touch the plugins themselves and assume the introspection as an "atomic" process that doesn't get updated.

If we feel like this would be a significant win, we can definitely pass an emitter to the plugin and have the plugin emit events during the normal execution too. Let me know what you think.

total_units=total_units,
)

emitter.datasource_progress_units(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, this event has been already emitted in line 118. Because if condition contains or i == len(chunks)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Initially I didn't have the or i == len(chunks) in the condition, so this was a remnant from then. I'll remove it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants